33  Phi Coefficient of Correlation

The Phi-Coefficient of Correlation is a measure of the degree of association between two binary variables. This statistic is a specific case of the Pearson correlation coefficient and can be used when dealing with dichotomous variables.

The phi coefficient ranges from -1 to 1, where:

The phi coefficient is often used in conjunction with the Chi-square test for 2x2 contingency tables to quantify the strength of the association between the variables. It provides a numeric measure of the relationship’s strength, whereas the chi-square test assesses the significance of that relationship.

Application Contexts:

Each of these tests and measures has its specific conditions and assumptions that must be met to ensure valid and reliable results. They are powerful tools in the arsenal of statistical analysis for categorical data, providing insights into patterns, associations, and differences among groups or variables.

Let’s use an example problem to calculate the Phi Coefficient of Correlation for a 2x2 contingency table.

33.1 Example Problem:

Imagine a study looking at the relationship between having a gym membership (Yes or No) and being classified as physically active (Active or Not Active). Here’s the data collected from 200 individuals:

Active Not Active Total
Gym Member 80 20 100
No Gym Member 30 70 100
Total 110 90 200

Step-by-Step Calculation:

First, let’s label the counts in our contingency table: - $ a = 80 $ (Active and Gym Member) - $ b = 20 $ (Not Active and Gym Member) - $ c = 30 $ (Active and No Gym Member) - $ d = 70 $ (Not Active and No Gym Member)

Phi Coefficient Formula: \[ \phi = \frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}} \]

Plugging in the values, we get: \[ \phi = \frac{(80 \times 70) - (20 \times 30)}{\sqrt{(80+20)(30+70)(80+30)(20+70)}} \] \[ \phi = \frac{5600 - 600}{\sqrt{100 \times 100 \times 110 \times 90}} \] \[ \phi = \frac{5000}{\sqrt{10000 \times 9900}} \] \[ \phi = \frac{5000}{\sqrt{99000000}} \] \[ \phi = \frac{5000}{9950} \] \[ \phi \approx 0.5025 \]

Interpretation:

The calculated Phi Coefficient of approximately 0.50 suggests a moderate positive association between having a gym membership and being physically active. This indicates that individuals with gym memberships are more likely to be classified as active compared to those without memberships. The value is positive, showing that the association is in the expected direction (more gym members are active), and a value of 0.50 indicates a noticeable correlation but not an extremely strong one.

33.1.1 Phi-Coefficient of Correlation calculation using R:

Code
# Create a matrix for the observed frequencies
observed_matrix <- matrix(c(80, 20, 30, 70),
      nrow = 2,  # Two rows for Gym Member and No Gym Member
      ncol = 2,  # Two columns for Active and Not Active
      byrow = TRUE,  # Fill matrix by rows
      dimnames = list(c("Gym Member", "No Gym Member"),c("Active", "Not Active")))

# Perform the Chi-square Test of Independence to get the chi-squared statistic
test_result <- chisq.test(observed_matrix)

# Calculate the Phi Coefficient
phi_coefficient <- sqrt(test_result$statistic / sum(observed_matrix))

# Print the results
print(paste("Phi Coefficient:", phi_coefficient))
[1] "Phi Coefficient: 0.492468529477014"

33.1.2 Phi-Coefficient of Correlation calculation using Python:

Code
import numpy as np
import scipy.stats as stats

# Create an array for the observed frequencies
observed = np.array([[80, 20],
                     [30, 70]])

# Perform the Chi-square Test of Independence to get the chi-squared statistic and expected frequencies
chi2_stat, p_val, dof, expected = stats.chi2_contingency(observed)

# Calculate the Phi Coefficient
phi_coefficient = np.sqrt(chi2_stat / observed.sum())

# Print the results
print("Phi Coefficient:", phi_coefficient)
Phi Coefficient: 0.4924685294770139

In the Python code, the chi2_contingency() function from SciPy’s stats module is used to compute the chi-square statistic, and then the Phi Coefficient is calculated as the square root of the chi-square statistic divided by the total sample size.